public class ExternalSort extends Object
Modifier and Type | Field and Description |
---|---|
static Comparator<String> |
defaultcomparator
default comparator between strings.
|
static int |
DEFAULTMAXTEMPFILES
Default maximal number of temporary files allowed.
|
Constructor and Description |
---|
ExternalSort() |
Modifier and Type | Method and Description |
---|---|
static long |
estimateAvailableMemory()
This method calls the garbage collector and then returns the free
memory.
|
static long |
estimateBestSizeOfBlocks(long sizeoffile,
int maxtmpfiles,
long maxMemory)
we divide the file into small blocks.
|
static void |
main(String[] args) |
static int |
mergeSortedFiles(BufferedWriter fbw,
Comparator<String> cmp,
boolean distinct,
List<com.google.code.externalsorting.BinaryFileBuffer> buffers)
This merges several BinaryFileBuffer to an output writer.
|
static int |
mergeSortedFiles(List<File> files,
File outputfile)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
boolean distinct)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct,
boolean append,
boolean usegzip)
This merges a bunch of temporary flat files
|
static void |
sort(File input,
File output)
This sorts a file (input) to an output file (output) using default
parameters
|
static File |
sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory)
Sort a list and save it to a temporary file
|
static File |
sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory,
boolean distinct,
boolean usegzip,
boolean parallel)
Sort a list and save it to a temporary file
|
static List<File> |
sortInBatch(BufferedReader fbr,
long datalength)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(BufferedReader fbr,
long datalength,
Comparator<String> cmp,
boolean distinct)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(BufferedReader fbr,
long datalength,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip,
boolean parallel)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
boolean distinct)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
File tmpdirectory,
boolean distinct,
int numHeader)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
Charset cs,
File tmpdirectory,
boolean distinct)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip,
boolean parallel)
This will simply load the file by blocks of lines, then sort them
in-memory, and write the result to temporary files that have to be
merged later.
|
public static Comparator<String> defaultcomparator
public static final int DEFAULTMAXTEMPFILES
public static long estimateAvailableMemory()
public static long estimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)
sizeoffile
- how much data (in bytes) can we expectmaxtmpfiles
- how many temporary files can we create (e.g., 1024)maxMemory
- Maximum memory to use (in bytes)public static void main(String[] args) throws IOException
args
- command line argumentIOException
- generic IO exceptionpublic static int mergeSortedFiles(BufferedWriter fbw, Comparator<String> cmp, boolean distinct, List<com.google.code.externalsorting.BinaryFileBuffer> buffers) throws IOException
fbw
- A buffer where we write the data.cmp
- A comparator object that tells us how to sort the
lines.distinct
- Pass true
if duplicate lines should be
discarded.buffers
- Where the data should be read.IOException
- generic IO exceptionpublic static int mergeSortedFiles(List<File> files, File outputfile) throws IOException
files
- The List
of sorted File
s to be merged.outputfile
- The output File
to merge the results to.IOException
- generic IO exceptionpublic static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp) throws IOException
files
- The List
of sorted File
s to be merged.outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare
String
s.IOException
- generic IO exceptionpublic static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct) throws IOException
files
- The List
of sorted File
s to be merged.outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare
String
s.distinct
- Pass true
if duplicate lines should be
discarded.IOException
- generic IO exceptionpublic static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs) throws IOException
files
- The List
of sorted File
s to be merged.outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare
String
s.cs
- The Charset
to be used for the byte to
character conversion.IOException
- generic IO exceptionpublic static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct) throws IOException
files
- The List
of sorted File
s to be merged.distinct
- Pass true
if duplicate lines should be
discarded.outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare
String
s.cs
- The Charset
to be used for the byte to
character conversion.IOException
- generic IO exceptionpublic static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip) throws IOException
files
- The List
of sorted File
s to be merged.distinct
- Pass true
if duplicate lines should be
discarded.outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare
String
s.cs
- The Charset
to be used for the byte to
character conversion.append
- Pass true
if result should append to
File
instead of overwrite. Default to be false
for overloading methods.usegzip
- assumes we used gzip compression for temporary filesIOException
- generic IO exceptionpublic static void sort(File input, File output) throws IOException
input
- source fileoutput
- output fileIOException
- generic IO exceptionpublic static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory) throws IOException
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)IOException
- generic IO exceptionpublic static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel) throws IOException
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.usegzip
- set to true
if you are using gzip compression for the
temporary filesparallel
- set to true
when sorting in parallelIOException
- generic IO exceptionpublic static List<File> sortInBatch(BufferedReader fbr, long datalength) throws IOException
fbr
- data sourcedatalength
- estimated data volume (in bytes)IOException
- generic IO exceptionpublic static List<File> sortInBatch(BufferedReader fbr, long datalength, Comparator<String> cmp, boolean distinct) throws IOException
fbr
- data sourcedatalength
- estimated data volume (in bytes)cmp
- string comparatordistinct
- Pass true
if duplicate lines should be
discarded.IOException
- generic IO exceptionpublic static List<File> sortInBatch(BufferedReader fbr, long datalength, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws IOException
fbr
- data sourcedatalength
- estimated data volume (in bytes)cmp
- string comparatormaxtmpfiles
- maximal number of temporary filesmaxMemory
- maximum amount of memory to use (in bytes)cs
- character set to use (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesparallel
- sort in parallelIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file) throws IOException
file
- some flat fileIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp) throws IOException
file
- some flat filecmp
- string comparatorIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct) throws IOException
file
- some flat filecmp
- string comparatordistinct
- Pass true
if duplicate lines should be
discarded.IOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, File tmpdirectory, boolean distinct, int numHeader) throws IOException
file
- some flat filecmp
- string comparatortmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.numHeader
- number of lines to preclude before sorting startsIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.IOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, int numHeader) throws IOException
file
- some flat filecmp
- string comparatorcs
- character set to use (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.numHeader
- number of lines to preclude before sorting startsIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.numHeader
- number of lines to preclude before sorting startsIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesIOException
- generic IO exceptionpublic static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use
Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for
default location)distinct
- Pass true
if duplicate lines should be
discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesparallel
- whether to sort in parallelIOException
- generic IO exceptionCopyright © 2017. All rights reserved.