數據科學處理Python下的系列工具(庫):Rosetta

jopen 9年前發布 | 19K 次閱讀 Rosetta Python開發

RosettaPython下的系列工具( 庫),為數據科學處理尤其是文本處理提供支持,其中對并行、大文件處理等方面的優化非常好。

Tools for data science with a focus on text processing.

  • Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
  • Integrates with existing scientific Python stack as well as select outside tools.
  • </ul>

    Examples

    • See theexamples/directory.
    • The docs contain plots of example output.
    • </ul>

      Packages

      cmdutils

      • Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
      • Focus on stream processing and csv files.
      • </ul>

        parallel

        • Wrappers for Python multiprocessing that add ease of use
        • Memory-friendly multiprocessing
        • </ul>

          text

          • Stream text from disk to formats used in common ML processes
          • Write processed text to sparse formats
          • Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
          • Other general utilities
          • </ul>

            workflow

            • High-level wrappers that have helped with our workflow and provide additional examples of code use
            • </ul>

              modeling

              • General ML modeling utilities
              • </ul>

                Install

                Check out the master branch from the rosettarepo. Then, (so long as you havepip).

                cd rosetta
                make
                make test

                If you update the source, you can do

                make reinstall
                make test

                The abovemaketargets usepip, so you can of course dopip uninstallat any time.

                Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) here. Then

                pip install rosetta-X.X.X.tar.gz

                Development

                Code

                You can get the latest sources with

                git clone git://github.com/columbia-applied-data-science/rosetta

                Contributing

                Feel free to contribute a bug report or a request by opening an issue

                The preferred method to contribute is to fork and send a pull request. Before doing this, read CONTRIBUTING.md

                Dependencies

                • Major dependencies on Pandas and numpy.
                • Minor dependencies on Gensim and statsmodels.
                • Some examples need scikit-learn.
                • Minor dependencies on docx
                • Minor dependencies on the unix utilities pdftotext and catdoc

                Testing

                From the base repo directory,rosetta/, you can run all tests with

                make test

                項目主頁:http://www.baiduhome.net/lib/view/home/1422504070611

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!